Journal of Open Source Software — Latest Matching Preprints

1

DigitalPedon: A Novel Digital Twin Framework for Soil Profile Monitoring and Global Soil Data Interoperability

Youssef, A.; Badreldin, N.

2026-05-08 bioengineering 10.64898/2026.05.05.722891 medRxiv

Top 0.1%

13.0%

Show abstract

The Digital Pedon (DP) is an open-source Python framework that represents a soil profile as a continuously updated digital twin, bridging three persistent gaps in soil science: disconnected models and observations, cross-database interoperability, and the inference gap between raw sensor signals and agronomically meaningful variables. Integrating real-time sensor streams, model-based solver chains (Model-Zoo), GLOSIS-compliant ontology mapping, and a novel LLM agentic interface layer enabling natural language soil queries, the DP supports applications spanning precision agriculture, digital soil mapping, and environmental sustainability assessment. Four proof-of-concept experiments confirm automatic profile initialisation fidelity, solver chain consistency, ontology compliance, and user-defined solver extensibility.

2

golgi: open-source software for automated nerve model generation and recruitment simulation

Lung, D.; Jia, Y.; Moro, A.; Fachino, M.; Haberbusch, M.

2026-07-13 bioengineering 10.64898/2026.07.10.737846 medRxiv

Top 0.1%

12.2%

Show abstract

golgi is an open-source platform that takes a peripheral nerve from image to stimulated fiber population through a single graphical interface, with an equivalent scriptable Python API and command-line interface for batch and high-performance use. It integrates promptable image segmentation, automated multi-region tetrahedral meshing, anisotropic finite-element solution of the extracellular field with an explicit perineurium contact impedance, generation of realistic fiber populations and their three-dimensional trajectories, and biophysical activation thresholds through interchangeable backends-- NEURON (via PyFibers) and a GPU-accelerated surrogate (AxonML). Every study exports as an integrity-hashed bundle whose image-to-recruitment provenance is verifiable byte-for-byte. golgi lowers the barrier to in-silico peripheral nerve stimulation modeling for experimentalists and clinicians, using a fully open finite-element stack with no commercial dependencies.

3

BaSiCPy: Scalable and Robust Shading Correction for Optical Microscopy Images

Liu, Y.; Fukai, Y. T.; Cano-Muniz, S.; Perez, V.; Todorov, M.; Ortega, G.; Morello, T.; Loeffler, D.; Paetzold, J.; Xu, X.; Lamm, L.; Ma, N.; Erturk, A.; Schroeder, T.; Boeck, L.; Schapiro, D.; Schaub, N.; Marr, C.; Peng, T.

2026-05-01 bioengineering 10.64898/2026.04.28.721386 medRxiv

Top 0.1%

7.3%

Show abstract

Quantitative fluorescence microscopy is frequently confounded by spatially varying illumination and temporal intensity drift. Although BaSiC is a widely adopted retrospective correction method, it can fail when foreground content is strongly correlated across images--a common regime in time-lapse, tiled and volumetric acquisitions--and its application often requires manual parameter tuning that limits reproducibility and scalability. We introduce BaSiCPy, a foreground-aware implementation of BaSiC that improves illumination profile estimation under correlated foreground structures, provides automatic hyperparameter selection and accelerates large-scale processing through GPU support. BaSiCPy is distributed as an open-source Python package with graphical and programmatic interfaces, facilitating integration into contemporary bioimage analysis workflows.

4

CICADA: A unified framework for NWB-based neurophysiological data analysis

Hamon, M.; Lebert, J.; Denis, J.; Filippi, C.; Renard, A.; Bech, P.; Pulin, M.; Bisi, A.; Molinuevo Gomez, D.; Priestley, J. B.; Crochet, S.; Petersen, C. C.; Cossart, R.; Picardo, M. A.; Dard, R. F.

2026-07-08 neuroscience 10.64898/2026.07.03.736318 medRxiv

Top 0.1%

5.3%

Show abstract

Neurophysiology datasets are becoming increasingly complex, combining behavioral measurements with high-dimensional neuronal activity recordings coming from optical and/or electrophysiological measurements. The Neurodata Without Borders (NWB) standard has emerged in the community as the format of record. While standardized and widely used preprocessing tools generating NWB files have been developed, extensible frameworks for scientific analysis downstream of the NWB ecosystem are still under-represented. We present CICADA, a Python framework dedicated to analysis of neurophysiological data in the standardized NWB format. The toolbox is built as three hierarchically-organized packages: cicada-nwb (NWB access layer), cicada-analysis (plugin-based analysis engine and tool library), and cicada-gui (PyQt5 desktop application at the head of the pipeline). Beyond this architectural separation, CICADA is built around a central design principle: supporting a continuum from turnkey use to full modularity. Researchers can use the complete GUI-driven cicada-gui workflow without writing code, programmatically use existing analysis plugins from cicada-analysis, contribute to new analysis plugins, reuse utilities from cicada-tools, or build entirely custom pipelines on top of the cicada-nwb access layer alone. The same analysis plugin runs identically in interactive GUI and parameter-configured headless modes, enabling reproducible multi-session, multi-animal group analyses. We illustrate the versatility of CICADA with example analyses of behavioral, calcium imaging (two-photon and widefield) and extracellular electrophysiology datasets from rodent laboratories. CICADA is open source, actively maintained, and designed so that any laboratory can contribute at any level of the stack without modifying the core framework.

5

Reproducible and shareable bioinformatics pipelines from natural-language prompts

Kim, H.-M.; Jeong, H.; Mekonnen, A. M.; Kim, Y.; Oh, Y.; Lee, H.; Jung, C.; Park, J.

2026-06-01 bioinformatics 10.64898/2026.05.28.719125 medRxiv

Top 0.1%

4.8%

Show abstract

Large language models (LLMs) are increasingly used to generate bioinformatics pipelines and to carry out analyses from natural-language prompts. However, the resulting analyses are often difficult to reproduce across sessions, owing to the non-deterministic nature of LLM-driven conversations and heterogeneity of local execution environments, and cannot run on remote high-performance computing (HPC) servers or be shared and reused. We present Autopipe, a platform that guides any Model Context Protocol (MCP) - compatible LLM to produce, execute, and publish source-preserved, re-executable containerized pipelines. Autopipe enables users to execute bioinformatics pipelines on any on-premises remote servers - supported by comprehensive setup documentation aimed at researchers without prior server-administration experience - and to visualize results through an extensible web-based viewer. The Autopipe platform comprises four components: a desktop application with an embedded MCP server for pipeline management and remote execution, an online registry for pipeline and plugin discovery, a web-based result viewer, and a CLI tool for customizing viewer plugins. Autopipe turns conversational analysis into re-executable and shareable workflows. Autopipe is freely available at https://autopipe.org/.

6

Figra: A WebAssembly-based Excel Add-in for publication-quality scientific visualization with ggplot2

Sato, Y.

2026-05-12 bioinformatics 10.64898/2026.05.06.723320 medRxiv

Top 0.1%

4.8%

Show abstract

Data visualization is a critical step in scientific communication. Most researchers rely on subscription-based software for this purpose, which requires ongoing licensing costs. Free alternatives such as R and Python offer publication-quality output but demand programming expertise that many researchers do not possess. Artificial intelligence tools can assist with figure generation but remain frustrating when users wish to fine-tune specific visual parameters to their preference. Meanwhile, Microsoft Excel, the most widely used tool for scientific data storage and management, offers limited visualization capabilities, forcing researchers to transfer their data to external software as an extra step before creating figures. Here we present Figra, a free Excel Office Add-in that eliminates this extra step by enabling publication-quality ggplot2-based figure generation directly within Excel, with simple and direct control over every visual option. Figra leverages WebAssembly technology (webR) to execute R code entirely within the browser, requiring no R installation, no subscription, and no server connection. The add-in supports over 20 chart types spanning distribution plots, grouped comparisons, time-series, scatter plots, and specialized curve-fitting analyses. For applicable chart types, Figra performs automated or manual statistical analysis supporting both paired and unpaired designs across two or more groups. Additionally, Figra exports simplified, executable R code that reproduces the displayed figure, serving as an educational tool for researchers wishing to learn ggplot2. Figra is open-source and freely available at https://h20gg702.github.io/figra-pages/index.html while the source code is provided at https://github.com/h20gg702/Figra.

7

SaVanache: indexing and visualizing pangenome variation graphs

Mohamed, M.; Durant, E.; Rouard, M.; Muller, C.; Monat, C.; Conte, M.; Sabot, F.

2026-05-08 bioinformatics 10.64898/2026.05.05.722901 medRxiv

Top 0.1%

4.8%

Show abstract

With the rapid increase in genome sequencing and the growing availability of genomic resources, genomics is shifting toward pangenome representations that capture intra- and inter-specific diversity by integrating multiple genomes into a single entity. These pangenomes are increasingly modeled as graphs, encoding complex genomic variations in structures such as de Bruijn or variation graphs. However, while genome browsers provide standard and effective solutions for visualizing single or limited numbers of genomes, equivalent interactive tools for graph-based pangenomes remain limited, particularly for variation graph models. We developed SaVanache, a multi-resolution visualization interface designed to explore pangenome variation graphs at various depths. SaVanache enables the exploration of both global diversity and structural variations (SVs) across genomes relative to a user-defined linear pivot genome. Unlike synteny viewers, SaVanache emphasizes variations by representing SV types through a dedicated set of glyphs, facilitating intuitive one-to-many comparisons. To support smooth exploration, SaVanache preprocesses a Graphical Fragment Assembly (GFA) pangenome file into optimized index and data structures, enabling fast, real-time queries on large pangenome graphs. By combining advanced visualization techniques with efficient data handling, SaVanache provides a robust tool for scientists to analyze and visualize genetic variation within genomes and pangenomes, facilitating the identification of genetic determinants associated with phenotypes of interest and fully exploiting current genomic resources. Author summaryWe introduce SaVanache, an innovative tool that transforms the way we explore genomic resources. SaVanache allows visualization and analysis of pangenome variation graphs (PVGs), which capture genomic diversity by integrating structural variants (SV) and single nucleotide polymorphisms (SNPs) across multiple genomes. Unlike traditional genome browsers limited to a few genomes, SaVanache offers a multi-level, user-friendly interface that allows users to explore from whole pangenomes down to individual structural variants, enabling multidimensional research and development. Using a linear pivot genome as a visual reference, SaVanache simplifies complex PVG structures into intuitive comparisons. It efficiently handles large datasets and speeds up data retrieval through internal parsing. The front-end, built with modern JavaScript frameworks, provides interactive and responsive visualization, while the Python/Django backend supports real-time data updates. Users can detect and classify SVs by comparing syntenic segments between genomes, visualized through a novel glyph-based system that uses shapes and colors to represent complex rearrangements. SaVanache supports seamless zooming from chromosome-wide to nucleotide-level views, interactive diversity scatterplots, dynamic pivot genome switching, and grouping genomes by metadata to explore genotype-phenotype links. In addition, export functions bridge visualization with downstream bioinformatics. Developed with user feedback, SaVanache balances biological relevance and computational efficiency, overcoming PVG complexity to empower users with unprecedented insight into genomic diversity and SVs.

8

PolyFold: Evaluation of Open-Use Molecular Structure Prediction Algorithms to Inform Their Utility in Diverse Biological Applications

Stephenson, H.; Voicu, D.; Novakov, V.; Levy, M.; Marsilio, J.

2026-06-16 bioengineering 10.64898/2026.06.16.732304 medRxiv

Top 0.1%

4.7%

Show abstract

With the growing use of machine-learning-assisted pipelines for designing, characterizing, and optimizing biomolecules, the reliability of structure prediction models is increasingly important. PolyFold is a benchmarking framework developed to evaluate open-use structure prediction models, Boltz-2 and OpenFold 3, as commercially accessible alternatives to AlphaFold 3. We outline an end-to-end workflow automation tool to streamline input file creation, batch automation, and comprehensive analysis of model outputs for leading open-use structure prediction models. We curated an evaluation dataset of several thousand high-quality Protein Data Bank structures, homology-filtering against the training sets of both models to ensure a fair analysis. We then implemented an evaluation pipeline incorporating structural metrics (RMSD, TM-score, lDDT, etc.), interface metrics (DockQ, ilDDT, iRMSD, etc.), and physicochemical realism checks (based on bond lengths, angles, molecular internal energies, etc.). We identify key performance disparities, observing that Boltz-2 is generally superior to OpenFold 3, though the differential is partially attributable to residual homology leakage not accounted for by prevailing test set curation practices. We thus recommend a new method for homology-reducing when building a test set using length-weighted average fractional identity cutoffs rather than lowest chain fractional identity cutoffs. Even in eliminating residual leakage, Boltz-2 still performs better on full-set comparisons and a variety of important partitions (nucleic acids, protein-ligands, Ab-Ags, etc.). Both models are strong at folding monomeric structures, though struggle with homomultimer placement and small molecule physical realism, demonstrating enduring limitations of machine learning methods. This work is the first end-to-end, open-use, and reproducible platform for systematically assessing state-of-the-art structure prediction models. PolyFold enables practitioners to determine how models compare in performance on specific inference tasks and supports the broader adoption of accessible computational tools to facilitate biomolecular science.

9

PhyloZoo: a unified framework for phylogenetic network analysis in Python

Holtgrefe, N.

2026-06-11 bioinformatics 10.64898/2026.06.09.731120 medRxiv

Top 0.1%

4.1%

Show abstract

Reticulate evolutionary processes (events in which lineages merge, such as hybridization, recombination, and horizontal gene transfer) are widespread across nature but cannot be represented by phylogenetic trees alone. Phylogenetic networks have therefore become an important modelling tool, yet existing software is typically tied to specific inference paradigms and provides limited support for working with multiple network representations in a unified and programmable environment. PhyloZoo is an open-source Python framework that lowers the barrier to developing practical, easy-to-use software for phylogenetic network analysis. It provides data structures and algorithms covering the main representations used in the field, together with dedicated visualization tools and robust I/O for all major phylogenetic file formats. A particular emphasis lies on semi-directed phylogenetic networks, which explicitly represent root uncertainty and have so far received limited support in existing software. By offering a shared foundation for developing interoperable tools and a combinatorial layer that supports computational proofs and theoretical exploration, PhyloZoo enables reproducible workflows for applied, methodological, and theoretical studies of reticulate evolution. Availability and implementationPhyloZoo is implemented in Python and installable from PyPI, with source code, documentation, and examples available at https://github.com/nholtgrefe/phylozoo. Contactn.a.l.holtgrefe@tudelft.nl

10

AutoZyme: An Autonomous Agentic Framework to Optimize Bioinformatics Software

Xie, E.; Cheng, L.; Cai, Y.; Shireman, J.; Kendziorski, C.

2026-06-16 bioinformatics 10.64898/2026.06.12.731250 medRxiv

Top 0.1%

4.1%

Show abstract

Performance bottlenecks in widely used genomics and bioinformatics software present a substantial and growing burden as biological datasets continue to increase in size and number. Relieving these bottlenecks relies largely on expert manual optimization and therefore remains difficult to scale. Here we present AutoZyme, an agentic framework for scientific software optimization. Given a target function, AutoZyme builds benchmarks, identifies bottlenecks, and iteratively tests code changes, retaining only those that improve runtime while preserving output. We evaluated AutoZyme on 45 functions, improving runtime without substantial memory increases in over 95% of cases considered. Across 38 functions from Seurat, Scanpy and related packages in genomics and bioinformatics, AutoZyme reduced runtime by a median of 8.52-fold, with the largest reductions exceeding 676-fold. The optimized functions are distributed through AutoZyme-Library as drop-in replacements for existing analysis pipelines. We also release AutoZyme as a reusable framework for optimizing additional user-specified packages and functions.

11

Efficient and Tidy Manipulation of Annotated Matrix Data with plyxp

Landis, J. T.; Love, M. I.

2026-05-11 bioinformatics 10.64898/2026.05.06.721669 medRxiv

Top 0.1%

3.9%

Show abstract

Manipulating high-dimensional omics data, such as bulk or single cell gene expression counts matrices, typically requires a bioinformatics analyst to learn domain-specific functions and syntax. These matrix-centric functions and syntax can be less intuitive than working with tidy data analytic principles, as exemplified by tools such as dplyr applied to tabular data. We propose an expressive grammar for manipulating annotated matrix data, with syntax to access, modify, and append matrix data and tabular row and column metadata, including row-wise or columnwise grouped operations. This grammar defines multiple contexts, and providing pronouns for specific recall and assignment within and across these contexts. The plyxp package is an implementation of this grammar for the R/Bioconductor ecosystem, with efficient abstractions for the SummarizedExperiment class. We demonstrate plyxps efficiency compared to alternative approaches on data manipulation tasks requiring computation across contexts.

12

OpenEvo: An Open-Source Platform for Automated Evolution and Analysis

Cocioba, S. S.; Huang, P.-C.; Mallon, J.; Chan, Z.; Geremew, A. W.; Bisson, A.; Kyriakakis, P.

2026-07-07 bioengineering 10.64898/2026.07.06.735356 medRxiv

Top 0.1%

3.3%

Show abstract

Here we introduce OpenEvo, a fully open-source, low-cost turbidostat platform for automated continuous culture and directed evolution experiments. Existing tools are expensive, complex, or lack open-source hardware; OpenEvo addresses this gap. OpenEvo is a complete, fully automated evolution platform with detailed, illustrated construction instructions for beginners, open-source software and firmware, and a single device priced around $300. An optional PC-based version offers enhanced functionality, including remote access, programmable evolution cycles, programmable LED stimulation, and a data visualization tool. OpenEvo can cycle through three types of media for positive, negative, and neutral selection conditions, supporting a wide range of experimental designs. We validate the use of OpenEvo by evolving H. volcanii to grow from 15% to 12% salt over ~150 cycles, ~1,000 hours. Evolved cells grew 36% faster than wild-type at 12% salt. Whole-genome sequencing of adapted cells found SNPs and large deletions. We also demonstrate positive and negative selection using the OpenEvo LEDs to drive optogenetics via a Phytochrome B-based optogenetic tool, with light as the selection stimulus during over 4000 hours of growth. OpenEvo lowers the technical and cost barriers for continuous evolution experiments, serves as a teaching tool, and is designed to grow an open community of users who share modifications.

13

Cellfoundry: a GPU-accelerated, multi-physics ABM framework for cellular microenvironment and organoid-scale studies

Borau, C.; Chisholm, R.; Richmond, P.

2026-04-25 bioengineering 10.64898/2026.04.22.720218 medRxiv

Top 0.1%

3.3%

Show abstract

Advanced in vitro systems such as organoids and microfluidic organ-on-a-chip platforms enable physiologically richer experimentation, but their complexity creates large parameter spaces and makes it difficult to disentangle the mechanistic roles of transport, mechanics, and extracellular microstructure. Agent-based modelling provides a natural computational counterpart to these systems by representing heterogeneous cells as discrete entities coupled through local rules and environmental fields. However, realistic microenvironment models often remain limited by scalability, simplified extracellular matrix representations, and the practical difficulty of calibrating large numbers of parameters. Here we present Cellfoundry, a computational framework built on a FLAMEGPU2-based modelling template for simulating complex cellular microenvironments. The framework integrates multiple interacting agent populations, including cells, fibrous networks, and focal adhesions mediating attachment dynamics and traction-force transmission. It combines mechanically resolved cell-cell and cell-matrix interactions with multi-species diffusion fields that propagate biochemical signals through the extracellular environment and regulate processes such as metabolism, migration, and cell-cycle progression. Cellfoundry also supports customizable behaviours across multiple cell types, enabling the study of heterogeneous multicellular systems within a unified computational setting. To support reproducible model development and calibration, the framework includes a fibre-network generation module, automated performance benchmarking workflows, post-processing and reporting utilities, and an Optuna-based Bayesian optimization pipeline with configurable single- and multi-objective targets. Two showcase examples illustrate these capabilities: a migration assay calibrated against fibroblast motility descriptors and a multi-objective organoid growth scenario reproducing target population composition and expansion dynamics and over time. Together, these examples demonstrate how Cellfoundry can be used to build, calibrate, and extend mechanistically interpretable models of coupled biochemical and mechanical dynamics in advanced in vitro systems. HighlightsO_LIHighly versatile, GPU-accelerated agent-based framework for cellular microenvironments C_LIO_LIExplicit fibrous ECM networks with dynamic remodelling and focal adhesion agents C_LIO_LICoupled mechanics and multi-species diffusion regulate cell behaviour in a highly customizable environment C_LIO_LIModular architecture with automated benchmarking and Bayesian parameter optimization C_LI

14

SwiftNJ: Fast Exact Neighbour Joining via Correctness-Gated Coding Agents

Christensen, J.

2026-05-29 bioinformatics 10.64898/2026.05.28.728410 medRxiv

Top 0.1%

3.2%

Show abstract

The capability profile of frontier coding agents in 2026 varies sharply across technical domains, motivating domain-specific empirical study of where, and under what oversight conditions, such systems can contribute to specialised technical work. This paper presents one such study in computational phylogenetics. Neighbour joining (NJ) is a widely used distance-based method for inferring evolutionary trees in microbial epidemiology, comparative genomics, and large-scale sequence clustering. Its constant-factor runtime is set by hand-tuned native implementations; RapidNJ is a widely-cited representative of that class and serves here as the comparison baseline. We ask whether a current-generation coding agent, operating under a correctness-gated optimisation harness with deterministic correctness gates calibrated against a QuickTree reference, can advance that constant factor on a fixed benchmark. The resulting implementation, SwiftNJ, achieves a geometric-mean runtime ratio of 0.565 against a locally-rebuilt RapidNJ-native binary across a 59-matrix corpus, sub-parity on 58 of 59 matrices. On 400 shuffled inputs drawn from 16 small matrices (n [≤] 2000), SwiftNJ matched the QuickTree reference at Robinson-Foulds distance zero. In this domain, a correctness-gated coding agent meaningfully improved on a strong native baseline, suggesting that harness-guided optimisation holds promise for performance-critical bioinformatics tools; further work is needed to establish how broadly the approach generalises.

15

NeuVue: A scalable and customizable framework for electron microscopy proofreading

Xenes, D.; Kitchell, L. M.; Rivlin, P. K.; Martinez, H.; Rose, V.; Bishop, C.; Brodsky, R.; Celii, B.; Ellis-Joyce, J.; Luna, D.; Norman-Tenazas, R.; Ramsden, D.; Romero, K.; Villafane-Delgado, M.; Collman, F.; Gray-Roncal, W.; Reimer, J.; Wester, B.

2026-05-12 neuroscience 10.1101/2022.07.18.500521 medRxiv

Top 0.1%

3.2%

Show abstract

Connectomic reconstruction from large image volumes produces segmentation and synaptic-assignment errors that must be resolved to support downstream analyses. As datasets have grown larger and teams more distributed, proofreading has become a critical operational bottleneck. Workflows for proofreading and error correction have not scaled commensurately with connectomic data production and may not accommodate heterogeneous proofreader expertise and machine-generated candidate edits. New tools are therefore needed to organize, prioritize, and coordinate proofreading at volume scale. Here we present NeuVue, a task-management and prioritization framework that operationalizes proofreading through atomic, auditable tasks for individual and team review, multistage routing across proofreader cohorts, performance and volume-state tracking, and integration with community annotation, visualization, and analysis services. We report the use of NeuVue across two volumetric datasets, supporting scalable proofreading by over forty proofreaders and producing over fifty thousand edits. NeuVue provides a reproducible human-in-the-loop framework for generating, validating, and maintaining large connectomic datasets.

16

cran2crux: automatically create CRUX ports for R-packages

Petrov, P.; Izzi, V.

2026-05-13 bioinformatics 10.64898/2026.05.09.723963 medRxiv

Top 0.1%

3.2%

Show abstract

MotivationR together with CRAN and Bioconductor provides one of the richest ecosystems for bioinformatics and computational biology, with thousands of specialized packages. While GNU/Linux is a vastly-used operating system in this field, R-packages are typically managed independently of the systems native package manager. This separation makes installation, updates and mass rebuilds cumbersome. CRUX, a minimalist semi-source GNU/Linux distribution, offers great flexibility with its ports-based system for the seamless integration of R-packages with its native package manager. ResultsThe hereby presented cran2crux tool automatically generates CRUX ports for packages from both CRAN and Bioconductor. It performs recursive dependency resolution, handles naming conventions, extracts dependencies information, and supports inclusion of optional dependencies. The tool also provides convenient functions for checking updates and regenerating outdated ports. It can generate over 140 ports for complex packages such as Seurat in approximately 11 seconds, dramatically simplifying the maintenance of large R-dedicated repositories on CRUX. Availabilitycran2crux is available under the MIT license at https://github.com/izzilab/cran2crux. As of now, more than 650 R package ports, generated with the tool, are available in the CRUX ports database.

17

A motif for domain-specific analysis applets that are easy to learn, reuse, test, and to compose into pipelines: application to vision science

Lepsky, A. A.; Severson, M. K.; Wang, R.; Cheng, X.; Rodriguez, R. L.; Gong, R.; Van Hooser, S. D.

2026-04-30 neuroscience 10.64898/2026.04.27.721136 medRxiv

Top 0.1%

3.0%

Show abstract

Scientific progress depends on the analysis of primary data, yet the small, domain-specific programs that perform most scientific analyses are typically poorly documented, narrowly tested, and difficult to reuse outside the lab that created them. General-purpose pipeline tools address the problem of running steps in order but do not enforce documentation, testing, or standardized outputs. We describe a motif for building domain-specific analysis applets, which we call calculators, that constrains developer choices in order to produce code that is readable, tested, and reusable almost as a byproduct of following the template. Calculators operate on a typed, searchable database of documents, eliminating the need to explicitly wire inputs and outputs together; instead, each calculator searches the database for documents it can operate on and adds its results as new typed documents. Calculators must provide documentation in a standard location, self-tests that can be run and inspected interactively, adjustable input parameters, a single well-defined output document type, and a default plotting method. Sets of calculators compose naturally into pipelines whose outputs satisfy FAIR principles at every stage. We demonstrate the motif by implementing calculators for common analyses in vision science, including orientation and direction selectivity, contrast tuning, spatial and temporal frequency tuning, speed tuning, and Hartley reverse correlation. These calculators have been used in published work and are in active use across collaborating laboratories. We discuss the design principles of the motif, its advantages and limitations, and its applicability to domain-specific computation across neuroscience and beyond. Significance StatementScientists often must write small programs to analyze their own data. These programs are usually poorly documented, lightly tested, and hard for other labs to reuse. Mistakes in this kind of code have even caused well-known papers to be retracted. We describe a simple pattern for writing these programs, which we call a calculator. The pattern requires the programmer to include clear documentation, built-in tests, adjustable settings, and a standard form of output. Calculators work by searching a shared database for data they know how to handle, so many calculators can be chained together into a pipeline without extra setup. We show how this works by building calculators for common visual neuroscience analyses that other labs are already using.

18

SPACKLE: A spatial-first framework for multi-layer spatial transcriptomic analysis

Maynard, T. M.

2026-05-29 bioinformatics 10.64898/2026.05.26.727917 medRxiv

Top 0.1%

2.6%

Show abstract

BackgroundThe emergence of accessible spatial transcriptomic platforms such as 10x Genomics Visium HD and Xenium has created demand for analysis tools that can handle the complexity and scale of spatial datasets. Current frameworks approach spatial data primarily as an extension of single-cell RNA-seq pipelines, where spatial coordinates are retained as metadata rather than treated as a first-class organizing principle. As a result, common tasks such as multi-modal data alignment, region-of-interest selection, and cross-resolution visualization require manually managing disparate data types, coordinates, and scales, making spatial analysis unnecessarily time-consuming and error-prone. ResultsWe present SPACKLE (Spatial Platform for Analysis of Composite stacKs and Layered data Extraction), a Python-based "spatial-first" framework that treats absolute physical micron coordinates as the organizing principle for all data types. All data - morphology images, transcript point clouds, expression matrices, segmented cells, and user-defined regions - are stored as typed objects ("Channels") that carry their own spatial metadata, keeping all layers in automatic registration regardless of platform, resolution, or analysis operation. Two complementary interfaces simplify access to underlying data: the ViewPort, a compositing engine for efficient multi-channel visualization, and the DataPort, which extracts raw data in its native format for downstream analysis. A set of spatial analysis tools demonstrates the practical benefits of the framework, including ROI-based expression binning, cortical unfolding, and sub-micron fine alignment of transcript and image data. The use of modern Python data management methods helps maintain the efficiency of the framework, allowing for quick visualizations and analysis with a low memory footprint. ConclusionsSPACKLE is designed to complement rather than replace widely used tools in the spatial analysis ecosystem (Scanpy, Squidpy, CellPose, StarDist), by handling the spatial mechanics of large datasets so that the analyst can focus on the biology. SPACKLE is freely available under the MIT license at https://github.com/maynardt/spackle.

19

fuzzyfold: a high-performance framework for stochastic RNA folding kinetics

Badelt, S.

2026-06-18 bioinformatics 10.64898/2026.06.17.732885 medRxiv

Top 0.1%

2.5%

Show abstract

The analysis of nucleic acid secondary structures is overwhelmingly dominated by methods that analyze the thermodynamic equilibrium distribution and which ignore all dynamic aspects of nucleic acid folding. Yet, there are numerous popular examples of nucleic acid folding that rely on kinetic models, such as RNA riboswitches or DNA strand displacement systems. Here, I am presenting fuzzyfold, a Rust-based software package for nucleic acid secondary structure analysis with an explicit focus on stochastic modeling. The framework introduces three-way and four-way shift moves with a biophysically motivated rate-model parameterization, and it is developed with an emphasis on both model flexibility and performance, e.g. allowing for the generation of single co-transcriptional trajectories for thousand-nucleotide long RNA molecules in just a few minutes. The main strength of the fuzzyfold package, however, is its focus on user and developer interfaces for long-term development. It provides easily installable command-line interfaces, e.g. for aggregating data from multiple parallel trajectories efficiently into an ensemble-level dynamic analysis. For developers, the code-base supports straight-forward substitution of thermodynamic and kinetic free-energy models, and a flexible library interface with Python bindings, enabling integration of individual components into custom computational workflows.

20

HydraMPP: A lightweight library for distributed massive parallel processing in Python - threading at scale.

Figueroa, J. L.; White, R. A.

2026-06-08 bioinformatics 10.64898/2026.06.04.730204 medRxiv

Top 0.1%

2.4%

Show abstract

We now exist in the era of massive datasets from genomics, large language models, and all the known knowledge of humanity right at our fingertips. Much of this data is becoming more accessible; however, processing such data remains an ongoing issue across systems including high performance computing (HPC) infrastructures. Massively parallel computing (MPP) has solved this using a divide and conquer approach by splitting workloads across independent nodes (i.e., central processing units (CPU) allowing for higher scaling of data). The main engine for this in python is Ray; however, it has many issues including a large code space, security issues, debugging opacity, and memory management issues. Here, we present HydraMPP, a lightweight, ease of use and utilization, with high auditability, and with SLURM ergonomics.